Reinforcement Learning: An Introduction: Scaling RL: The Move to Function Approximation

In the transition from tabular methods to Function Approximation, we confront the reality that for most complex environments, a lookup table is physically impossible. When the number of states $n$ reaches astronomical scales, we must represent the value function as a parameterized mapping: $v_{\pi}(s) \approx \hat{v}(s, \mathbf{w})$.

The Feasibility Inquiry

You might ask: "Is there any reason to think this might be possible?" Can we really represent $10^{170}$ board states with just a few million weights? The answer lies in the regularity of our world. Similar states usually have similar values. By normalizing features into a stable range like [0, 1], we allow the agent to detect patterns—like "territory control" in Go—that apply to billions of configurations it has never explicitly seen.

Generalization vs. Discrimination

Generalization: The superpower of FA. Learning about state A informs the estimate for a similar state B. This is the only way to scale.
Discrimination: The ability to tell two states apart. Tabular methods have perfect discrimination but zero generalization; FA trades discrimination for the ability to predict the unknown.

QUESTION 1

What is the primary significance of the inequality $m \ll n$ in function approximation?

It means the agent must memorize every state to ensure accuracy.

It forces the agent to compress information and capture underlying patterns.

It refers to the learning rate being much smaller than the number of states.

QUESTION 2

In the game of Go, how does function approximation allow an agent to handle $10^{170}$ states?

By storing board configurations in a high-speed database.

By using a hash table that maps every state to a unique probability.

By extracting features like 'stone connectivity' and mapping them to a win probability.